We will be investigating the same dataset as in HW1: spambase.csv from OpenML-100 databases. This database concerns emails, of which some were classified as spam emails (~39%), whereas the rest were work and personal emails. We will be training a few Random Forest Classifier models and an XGBoost and comparing their permutation-based variable importances. We will also look at and compare other types of variable importances, namely TreeShap and sklearn featureimportance (mean decrease in impurity)
Comment:
Comment:
Comment:
Below a table showing the variable importances of our original RF model (depth=8, variables=0.3), but using inbuilt SKLearn variableimportances (mean decrease in impurity)
| Variable | Importance |
|---|---|
| charfreq%21 | 0.191269 |
| charfreq%24 | 0.171319 |
| word_freq_remove | 0.134845 |
| word_freq_free | 0.082156 |
| capital_run_length_average | 0.073821 |
| word_freq_your | 0.058051 |
| word_freq_hp | 0.052018 |
| word_freq_money | 0.036021 |
| word_freq_our | 0.024775 |
| word_freq_000 | 0.019361 |
Comment:
Below a chart showing the variable importances of our original RF model (depth=8, variables=0.3), but using TreeShap:

Comment:
import numpy as np
import pandas as pd
import dalex as dx
import lime
spambase = pd.read_csv("spambase.csv")
df = spambase.drop(spambase.columns[0], axis=1) #Cleaning first column which is just index
df.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_table | word_freq_conference | char_freq_%3B | char_freq_%28 | char_freq_%5B | char_freq_%21 | char_freq_%24 | char_freq_%23 | capital_run_length_average | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | ... | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 |
| mean | 0.104553 | 0.213015 | 0.280656 | 0.065425 | 0.312223 | 0.095901 | 0.114208 | 0.105295 | 0.090067 | 0.239413 | ... | 0.005444 | 0.031869 | 0.038575 | 0.139030 | 0.016976 | 0.269071 | 0.075811 | 0.044238 | 5.191515 | 0.394045 |
| std | 0.305358 | 1.290575 | 0.504143 | 1.395151 | 0.672513 | 0.273824 | 0.391441 | 0.401071 | 0.278616 | 0.644755 | ... | 0.076274 | 0.285735 | 0.243471 | 0.270355 | 0.109394 | 0.815672 | 0.245882 | 0.429342 | 31.729449 | 0.488698 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.588000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.065000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.276000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 0.420000 | 0.000000 | 0.380000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.160000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.188000 | 0.000000 | 0.315000 | 0.052000 | 0.000000 | 3.706000 | 1.000000 |
| max | 4.540000 | 14.280000 | 5.100000 | 42.810000 | 10.000000 | 5.880000 | 7.270000 | 11.110000 | 5.260000 | 18.180000 | ... | 2.170000 | 10.000000 | 4.385000 | 9.752000 | 4.081000 | 32.478000 | 6.003000 | 19.829000 | 1102.500000 | 1.000000 |
8 rows × 56 columns
X = df.loc[:, df.columns != 'TARGET']
y = df.loc[:, df.columns == 'TARGET']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits = 5)
from sklearn.ensemble import RandomForestClassifier
RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X_train, y_train)
print("Train accuracy: ", accuracy_score(y_train, RF_final.predict(X_train)))
print("Test accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_22288\3518268967.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X_train, y_train)
Train accuracy: 0.9553140096618358 Test accuracy: 0.9609544468546638
RFexplainer = dx.Explainer(RF_final, X_test, y_test)
RFexplainer.model_performance()
Preparation of a new explainer is initiated -> data : 461 rows 55 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 461 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x00000137E7081940> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0169, mean = 0.416, max = 0.991 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.671, mean = -0.00337, max = 0.859 -> model_info : package sklearn A new explainer has been created!
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names warnings.warn(
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| RandomForestClassifier | 0.926316 | 0.977778 | 0.951351 | 0.960954 | 0.993397 |
pvi = RFexplainer.model_parts(random_state=0)
pvi.result
| variable | dropout_loss | label | |
|---|---|---|---|
| 0 | word_freq_report | 0.006545 | RandomForestClassifier |
| 1 | char_freq_%23 | 0.006562 | RandomForestClassifier |
| 2 | word_freq_data | 0.006586 | RandomForestClassifier |
| 3 | word_freq_conference | 0.006588 | RandomForestClassifier |
| 4 | word_freq_85 | 0.006592 | RandomForestClassifier |
| 5 | word_freq_cs | 0.006597 | RandomForestClassifier |
| 6 | char_freq_%5B | 0.006599 | RandomForestClassifier |
| 7 | word_freq_table | 0.006599 | RandomForestClassifier |
| 8 | word_freq_receive | 0.006601 | RandomForestClassifier |
| 9 | word_freq_857 | 0.006603 | RandomForestClassifier |
| 10 | _full_model_ | 0.006603 | RandomForestClassifier |
| 11 | word_freq_telnet | 0.006611 | RandomForestClassifier |
| 12 | word_freq_direct | 0.006613 | RandomForestClassifier |
| 13 | word_freq_all | 0.006617 | RandomForestClassifier |
| 14 | word_freq_3d | 0.006617 | RandomForestClassifier |
| 15 | word_freq_415 | 0.006621 | RandomForestClassifier |
| 16 | word_freq_original | 0.006623 | RandomForestClassifier |
| 17 | word_freq_pm | 0.006625 | RandomForestClassifier |
| 18 | word_freq_credit | 0.006627 | RandomForestClassifier |
| 19 | word_freq_project | 0.006630 | RandomForestClassifier |
| 20 | word_freq_addresses | 0.006638 | RandomForestClassifier |
| 21 | word_freq_make | 0.006640 | RandomForestClassifier |
| 22 | word_freq_labs | 0.006648 | RandomForestClassifier |
| 23 | word_freq_parts | 0.006658 | RandomForestClassifier |
| 24 | char_freq_%28 | 0.006660 | RandomForestClassifier |
| 25 | word_freq_over | 0.006665 | RandomForestClassifier |
| 26 | word_freq_address | 0.006687 | RandomForestClassifier |
| 27 | word_freq_lab | 0.006691 | RandomForestClassifier |
| 28 | word_freq_mail | 0.006698 | RandomForestClassifier |
| 29 | word_freq_technology | 0.006708 | RandomForestClassifier |
| 30 | word_freq_email | 0.006724 | RandomForestClassifier |
| 31 | word_freq_people | 0.006729 | RandomForestClassifier |
| 32 | word_freq_order | 0.006733 | RandomForestClassifier |
| 33 | word_freq_1999 | 0.006743 | RandomForestClassifier |
| 34 | word_freq_internet | 0.006764 | RandomForestClassifier |
| 35 | char_freq_%3B | 0.006799 | RandomForestClassifier |
| 36 | word_freq_will | 0.006825 | RandomForestClassifier |
| 37 | word_freq_re | 0.006850 | RandomForestClassifier |
| 38 | word_freq_hpl | 0.006918 | RandomForestClassifier |
| 39 | word_freq_meeting | 0.006937 | RandomForestClassifier |
| 40 | word_freq_you | 0.006999 | RandomForestClassifier |
| 41 | word_freq_650 | 0.007009 | RandomForestClassifier |
| 42 | word_freq_our | 0.007225 | RandomForestClassifier |
| 43 | word_freq_font | 0.007398 | RandomForestClassifier |
| 44 | word_freq_business | 0.007625 | RandomForestClassifier |
| 45 | word_freq_your | 0.008427 | RandomForestClassifier |
| 46 | word_freq_000 | 0.008924 | RandomForestClassifier |
| 47 | word_freq_money | 0.009429 | RandomForestClassifier |
| 48 | word_freq_edu | 0.009889 | RandomForestClassifier |
| 49 | word_freq_george | 0.010276 | RandomForestClassifier |
| 50 | word_freq_free | 0.012529 | RandomForestClassifier |
| 51 | capital_run_length_average | 0.012735 | RandomForestClassifier |
| 52 | word_freq_hp | 0.013267 | RandomForestClassifier |
| 53 | char_freq_%24 | 0.018916 | RandomForestClassifier |
| 54 | word_freq_remove | 0.025955 | RandomForestClassifier |
| 55 | char_freq_%21 | 0.026308 | RandomForestClassifier |
| 56 | _baseline_ | 0.489602 | RandomForestClassifier |
pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)
featureimp = pd.DataFrame(data = {"Variable": RF_final.feature_names_in_, "Importance": RF_final.feature_importances_})
featureimp
| Variable | Importance | |
|---|---|---|
| 0 | word_freq_make | 0.001165 |
| 1 | word_freq_address | 0.001765 |
| 2 | word_freq_all | 0.002796 |
| 3 | word_freq_3d | 0.000314 |
| 4 | word_freq_our | 0.024775 |
| 5 | word_freq_over | 0.003399 |
| 6 | word_freq_remove | 0.134845 |
| 7 | word_freq_internet | 0.007125 |
| 8 | word_freq_order | 0.002147 |
| 9 | word_freq_mail | 0.002673 |
| 10 | word_freq_receive | 0.004296 |
| 11 | word_freq_will | 0.005077 |
| 12 | word_freq_people | 0.001294 |
| 13 | word_freq_report | 0.000896 |
| 14 | word_freq_addresses | 0.000538 |
| 15 | word_freq_free | 0.082156 |
| 16 | word_freq_business | 0.007490 |
| 17 | word_freq_email | 0.003160 |
| 18 | word_freq_you | 0.011135 |
| 19 | word_freq_credit | 0.002346 |
| 20 | word_freq_your | 0.058051 |
| 21 | word_freq_font | 0.001764 |
| 22 | word_freq_000 | 0.019361 |
| 23 | word_freq_money | 0.036021 |
| 24 | word_freq_hp | 0.052018 |
| 25 | word_freq_hpl | 0.011815 |
| 26 | word_freq_george | 0.019223 |
| 27 | word_freq_650 | 0.002613 |
| 28 | word_freq_lab | 0.000827 |
| 29 | word_freq_labs | 0.001923 |
| 30 | word_freq_telnet | 0.001212 |
| 31 | word_freq_857 | 0.000208 |
| 32 | word_freq_data | 0.001240 |
| 33 | word_freq_415 | 0.000344 |
| 34 | word_freq_85 | 0.001625 |
| 35 | word_freq_technology | 0.001655 |
| 36 | word_freq_1999 | 0.006814 |
| 37 | word_freq_parts | 0.000221 |
| 38 | word_freq_pm | 0.002059 |
| 39 | word_freq_direct | 0.000368 |
| 40 | word_freq_cs | 0.000530 |
| 41 | word_freq_meeting | 0.007260 |
| 42 | word_freq_original | 0.000916 |
| 43 | word_freq_project | 0.001113 |
| 44 | word_freq_re | 0.006195 |
| 45 | word_freq_edu | 0.017709 |
| 46 | word_freq_table | 0.000126 |
| 47 | word_freq_conference | 0.000787 |
| 48 | char_freq_%3B | 0.001909 |
| 49 | char_freq_%28 | 0.006084 |
| 50 | char_freq_%5B | 0.000869 |
| 51 | char_freq_%21 | 0.191269 |
| 52 | char_freq_%24 | 0.171319 |
| 53 | char_freq_%23 | 0.001338 |
| 54 | capital_run_length_average | 0.073821 |
import shap
shapExplainer = shap.TreeExplainer(RF_final)
explanation = shapExplainer(X_test)
shap_values = explanation.values
shap_values.shape
(461, 55, 2)
shap.plots.beeswarm(explanation[:,:,0])
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
RF_2 = RandomForestClassifier(n_estimators=200, max_depth = 3, max_features = 0.3, random_state = 1).fit(X_train, y_train)
print("Train accuracy: ", accuracy_score(y_train, RF_final.predict(X_train)))
print("Test accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_22288\1484883017.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Train accuracy: 0.9553140096618358 Test accuracy: 0.9609544468546638
RF2explainer = dx.Explainer(RF_2, X_test, y_test)
RF2explainer.model_performance()
Preparation of a new explainer is initiated -> data : 461 rows 55 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 461 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x00000137E7081940> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0888, mean = 0.412, max = 0.945 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.65, mean = 7.8e-05, max = 0.855 -> model_info : package sklearn A new explainer has been created!
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| RandomForestClassifier | 0.884211 | 0.976744 | 0.928177 | 0.943601 | 0.983842 |
pvi2 = RF2explainer.model_parts(random_state=0)
pvi2.plot(show=False).update_layout(autosize=False, width=600, height=450)
RF_3 = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.1, random_state = 1).fit(X_train, y_train)
print("Train accuracy: ", accuracy_score(y_train, RF_final.predict(X_train)))
print("Test accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_22288\2769589397.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Train accuracy: 0.9553140096618358 Test accuracy: 0.9609544468546638
RF3explainer = dx.Explainer(RF_3, X_test, y_test)
RF3explainer.model_performance()
Preparation of a new explainer is initiated -> data : 461 rows 55 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 461 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x00000137E7081940> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.00852, mean = 0.415, max = 0.996 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.637, mean = -0.003, max = 0.778 -> model_info : package sklearn A new explainer has been created!
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| RandomForestClassifier | 0.894737 | 0.982659 | 0.936639 | 0.950108 | 0.992795 |
pvi3 = RF3explainer.model_parts(random_state=0)
pvi3.plot(show=False).update_layout(autosize=False, width=600, height=450)
import xgboost
model = xgboost.XGBClassifier(
n_estimators=50,
max_depth=2,
use_label_encoder=False,
eval_metric="logloss",
enable_categorical=True,
tree_method="hist"
)
model.fit(X_train, y_train)
C:\Users\Antek\anaconda3\lib\site-packages\xgboost\sklearn.py:1395: UserWarning: `use_label_encoder` is deprecated in 1.7.0.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=True, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=50, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=True, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=50, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=None, ...)def pf_xgboost_classifier_categorical(model, df):
df.loc[:, df.dtypes == 'object'] =\
df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
return model.predict_proba(df)[:, 1]
XGexplainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
Preparation of a new explainer is initiated -> data : 461 rows 55 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 461 values -> model_class : xgboost.sklearn.XGBClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function pf_xgboost_classifier_categorical at 0x00000137ECA9CE50> will be used -> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems. -> predicted values : min = 7.59e-05, mean = 0.42, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.793, mean = -0.00801, max = 0.759 -> model_info : package xgboost A new explainer has been created!
XGpvi = XGexplainer.model_parts(random_state=0)
XGpvi.result
| variable | dropout_loss | label | |
|---|---|---|---|
| 0 | word_freq_lab | 0.003849 | XGBClassifier |
| 1 | word_freq_technology | 0.003981 | XGBClassifier |
| 2 | word_freq_report | 0.003997 | XGBClassifier |
| 3 | word_freq_857 | 0.004001 | XGBClassifier |
| 4 | word_freq_email | 0.004001 | XGBClassifier |
| 5 | word_freq_direct | 0.004001 | XGBClassifier |
| 6 | word_freq_data | 0.004001 | XGBClassifier |
| 7 | word_freq_people | 0.004001 | XGBClassifier |
| 8 | word_freq_mail | 0.004001 | XGBClassifier |
| 9 | word_freq_make | 0.004001 | XGBClassifier |
| 10 | word_freq_all | 0.004001 | XGBClassifier |
| 11 | word_freq_addresses | 0.004001 | XGBClassifier |
| 12 | word_freq_address | 0.004001 | XGBClassifier |
| 13 | word_freq_font | 0.004001 | XGBClassifier |
| 14 | word_freq_85 | 0.004001 | XGBClassifier |
| 15 | word_freq_415 | 0.004001 | XGBClassifier |
| 16 | word_freq_3d | 0.004001 | XGBClassifier |
| 17 | word_freq_receive | 0.004001 | XGBClassifier |
| 18 | word_freq_original | 0.004001 | XGBClassifier |
| 19 | char_freq_%5B | 0.004001 | XGBClassifier |
| 20 | char_freq_%3B | 0.004001 | XGBClassifier |
| 21 | word_freq_telnet | 0.004001 | XGBClassifier |
| 22 | word_freq_parts | 0.004001 | XGBClassifier |
| 23 | _full_model_ | 0.004001 | XGBClassifier |
| 24 | word_freq_table | 0.004001 | XGBClassifier |
| 25 | word_freq_labs | 0.004001 | XGBClassifier |
| 26 | word_freq_credit | 0.004003 | XGBClassifier |
| 27 | char_freq_%23 | 0.004018 | XGBClassifier |
| 28 | word_freq_project | 0.004024 | XGBClassifier |
| 29 | word_freq_cs | 0.004034 | XGBClassifier |
| 30 | word_freq_order | 0.004051 | XGBClassifier |
| 31 | word_freq_hpl | 0.004067 | XGBClassifier |
| 32 | word_freq_pm | 0.004073 | XGBClassifier |
| 33 | word_freq_conference | 0.004172 | XGBClassifier |
| 34 | char_freq_%28 | 0.004211 | XGBClassifier |
| 35 | word_freq_1999 | 0.004263 | XGBClassifier |
| 36 | word_freq_will | 0.004294 | XGBClassifier |
| 37 | word_freq_over | 0.004333 | XGBClassifier |
| 38 | word_freq_your | 0.004403 | XGBClassifier |
| 39 | word_freq_internet | 0.004479 | XGBClassifier |
| 40 | word_freq_000 | 0.004512 | XGBClassifier |
| 41 | word_freq_you | 0.004682 | XGBClassifier |
| 42 | word_freq_money | 0.004747 | XGBClassifier |
| 43 | word_freq_business | 0.004748 | XGBClassifier |
| 44 | word_freq_re | 0.004954 | XGBClassifier |
| 45 | word_freq_meeting | 0.005331 | XGBClassifier |
| 46 | word_freq_free | 0.005690 | XGBClassifier |
| 47 | word_freq_our | 0.005928 | XGBClassifier |
| 48 | word_freq_650 | 0.006205 | XGBClassifier |
| 49 | word_freq_edu | 0.007174 | XGBClassifier |
| 50 | capital_run_length_average | 0.009688 | XGBClassifier |
| 51 | char_freq_%24 | 0.010285 | XGBClassifier |
| 52 | char_freq_%21 | 0.011817 | XGBClassifier |
| 53 | word_freq_remove | 0.012744 | XGBClassifier |
| 54 | word_freq_george | 0.017075 | XGBClassifier |
| 55 | word_freq_hp | 0.022303 | XGBClassifier |
| 56 | _baseline_ | 0.492079 | XGBClassifier |
XGpvi.plot(show=False).update_layout(autosize=False, width=600, height=450)